15 research outputs found

    Acceleration of a Full-scale Industrial CFD Application with OP2

    Get PDF

    Batch solution of small PDEs with the OPS DSL

    Get PDF
    In this paper we discuss the challenges and optimisations opportunities when solving a large number of small, equally sized discretised PDEs on regular grids. We present an extension of the OPS (Oxford Parallel library for Structured meshes) embedded Domain Specific Language, and show how support can be added for solving multiple systems, and how OPS makes it easy to deploy a variety of transformations and optimisations. The new capabilities in OPS allow to automatically apply data structure transformations, as well as execution schedule transformations to deliver high performance on a variety of hardware platforms. We evaluate our work on an industrially representative finance simulation on Intel CPUs, as well as NVIDIA GPUs

    Prognostic model to predict postoperative acute kidney injury in patients undergoing major gastrointestinal surgery based on a national prospective observational cohort study.

    Get PDF
    Background: Acute illness, existing co-morbidities and surgical stress response can all contribute to postoperative acute kidney injury (AKI) in patients undergoing major gastrointestinal surgery. The aim of this study was prospectively to develop a pragmatic prognostic model to stratify patients according to risk of developing AKI after major gastrointestinal surgery. Methods: This prospective multicentre cohort study included consecutive adults undergoing elective or emergency gastrointestinal resection, liver resection or stoma reversal in 2-week blocks over a continuous 3-month period. The primary outcome was the rate of AKI within 7 days of surgery. Bootstrap stability was used to select clinically plausible risk factors into the model. Internal model validation was carried out by bootstrap validation. Results: A total of 4544 patients were included across 173 centres in the UK and Ireland. The overall rate of AKI was 14·2 per cent (646 of 4544) and the 30-day mortality rate was 1·8 per cent (84 of 4544). Stage 1 AKI was significantly associated with 30-day mortality (unadjusted odds ratio 7·61, 95 per cent c.i. 4·49 to 12·90; P < 0·001), with increasing odds of death with each AKI stage. Six variables were selected for inclusion in the prognostic model: age, sex, ASA grade, preoperative estimated glomerular filtration rate, planned open surgery and preoperative use of either an angiotensin-converting enzyme inhibitor or an angiotensin receptor blocker. Internal validation demonstrated good model discrimination (c-statistic 0·65). Discussion: Following major gastrointestinal surgery, AKI occurred in one in seven patients. This preoperative prognostic model identified patients at high risk of postoperative AKI. Validation in an independent data set is required to ensure generalizability

    Beyond 16GB: Out-of-core stencil computations

    No full text
    Stencil computations are a key class of applications, widely used in the scientific computing community, and a class that has particularly benefited from performance improvements on architectures with high memory bandwidth. Unfortunately, such architectures come with a limited amount of fast memory, which is limiting the size of the problems that can be efficiently solved. In this paper, we address this challenge by applying the well-known cache-blocking tiling technique to large scale stencil codes implemented using the OPS domain specific language, such as CloverLeaf 2D, CloverLeaf 3D, and OpenSBLI. We introduce a number of techniques and optimisations to help manage data resident in fast memory, and minimise data movement. Evaluating our work on Intel’s Knights Landing Platform as well as NVIDIA P100 GPUs, we demonstrate that it is possible to solve 3 times larger problems than the on-chip memory size with at most 15% loss in efficiency

    Auto-vectorizing a large-scale production unstructured-mesh CFD application

    No full text
    For modern x86 based CPUs with increasingly longer vector lengths, achieving good vectorization has become very important for gaining higher performance. Using very explicit SIMD vector programming techniques has been shown to give near optimal performance, however they are difficult to implement for all classes of applications particularly ones with very irregular memory accesses and usually require considerable re-factorisation of the code. Vector intrinsics are also not available for languages such as Fortran which is still heavily used in large production applications. The alternative is to depend on compiler auto-vectorization which usually have been less effective in vectorizing codes with irregular memory access patterns. In this paper we present recent research exploring techniques to gain compiler auto-vectorization for unstructured mesh applications. A key contribution is details on software techniques that achieve auto-vectorisation for a large production grade unstructured mesh application from the CFD domain so as to benefit from the vector units on the latest Intel processors without a significant code re-write. We use code generation tools in the OP2 domain specific library to apply the auto-vectorising optimisations automatically to the production code base and further explore the performance of the application compared to the performance with other parallelisations such as on the latest NVIDIA GPUs. We see that there is considerable performance improvements with autovectorization. The most compute intensive parallel loops in the large CFD application shows speedups of nearly 40% on a 20 core Intel Haswell system compared to their nonvectorized versions. However not all loops gain due to vectorization where loops with less computational intensity lose performance due to the associated overheads

    Loop tiling in large-scale stencil codes at run-time with OPS

    No full text
    The key common bottleneck in most stencil codes is data movement, and prior research has shown that improving data locality through optimisations that schedule across loops do particularly well. However, in many large PDE applications it is not possible to apply such optimisations through compilers because there are many options, execution paths and data per grid point, many dependent on run-time parameters, and the code is distributed across different compilation units. In this paper, we adapt the data locality improving optimisation called iteration space slicing for use in large OPS applications both in shared-memory and distributed-memory systems, relying on run-time analysis and delayed execution. We evaluate our approach on a number of applications, observing speedups of 2×\times on the Cloverleaf 2D/3D proxy application, which contain 83/141 loops respectively, 3.5×3.5\times on the linear solver TeaLeaf, and 1.7×1.7\times on the compressible Navier-Stokes solver OpenSBLI. We demonstrate strong and weak scalability up to 4608 cores of CINECA&amp;apos;s Marconi supercomputer. We also evaluate our algorithms on Intel&amp;apos;s Knights Landing, demonstrating maintained throughput as the problem size grows beyond 16GB, and we do scaling studies up to 8704 cores. The approach is generally applicable to any stencil DSL that provides per loop data access information

    Design and Performance of the OP2 Library for Unstructured Mesh Applications.

    No full text
    OP2 is an “active” library framework for the solution of unstructured mesh applications. It aims to decouple the scientific specification of an application from its parallel implementation to achieve code longevity and near-optimal performance by re-targeting the back-end to different multi-core/many-core hardware. This paper presents the design of the OP2 code generation and compiler framework which, given an application written using the OP2 API, generates efficient code for state-of-the-art hardware (e.g. GPUs and multi-core CPUs). Through a representative unstructured mesh application we demonstrate the capabilities of the compiler framework to utilize the same OP2 hardware specific run-time support functionalities. Performance results show that the impact due to this sharing of basic functionalities is negligible

    Designing OP2 for GPU architectures

    No full text
    OP2 is an "active" library framework for the solution of unstructured mesh applications. It aims to decouple the specification of a scientific application from its parallel implementation to achieve code longevity and near-optimal performance through re-targeting the back-end to different multi-core/many-core hardware. This paper presents the design of the current OP2 library for generating efficient code targeting contemporary GPU platforms. In this we focus on some of the software architecture design choices and low-level optimizations to maximize performance on NVIDIA's Fermi architecture GPUs. The performance impact of these design choices is quantified on two NVIDIA GPUs (GTX560Ti, Tesla C2070) using the end-to-end performance of an industrial representative CFD application developed using the OP2 API. Results show that for each system, a number of key configuration parameters need to be set carefully in order to gain good performance. Utilizing a recently developed auto-tuning framework, we explore the effect of these parameters, their limitations and insights into optimizations for improved performance. © 2012 Elsevier Inc. All rights reserved

    Design and Performance of the OP2 Library for Unstructured Mesh Applications.

    No full text
    OP2 is an “active” library framework for the solution of unstructured mesh applications. It aims to decouple the scientific specification of an application from its parallel implementation to achieve code longevity and near-optimal performance by re-targeting the back-end to different multi-core/many-core hardware. This paper presents the design of the OP2 code generation and compiler framework which, given an application written using the OP2 API, generates efficient code for state-of-the-art hardware (e.g. GPUs and multi-core CPUs). Through a representative unstructured mesh application we demonstrate the capabilities of the compiler framework to utilize the same OP2 hardware specific run-time support functionalities. Performance results show that the impact due to this sharing of basic functionalities is negligible

    Block structured compressible Navier-Stokes solution using the OPS high-level abstraction

    Get PDF
    In this paper we report the development and validation of a compressible solver with shock capturing, using a domain-specific high-level abstraction framework, OPS, that is being developed at the University of Oxford. OPS uses an active library approach for block-structured meshes, capable of generating codes for a variety of parallel implementations with different parallelization strategies. Performance results on various architectures are reported for the 1D Shu-Osher test case
    corecore